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ABSTRACT 


The cost of system operational testing is steadily 
increasing. It is desirable for the software manager to know 
if the software is sufficiently well developed or reliable to 
Support such testing. Current software reliability models 
provide only point estimates of the mean time to next failure 
or expected number of errors to occur in additional testing 
time. 

The goal of this thesis is to take into account prediction 
uncertainties of a software reliability model. Bootstrapping 
1s used to provide the software manager with confidence limits 
of the predicted expected number of faults to occur for 
additional testing time. The results can be particularly 
useful to a software manager who has to answer a subjective 
question: is the software reliable enough to support system 
operational testing? A range of predicted expected number of 
faults will be of more use to a software manager, who has to 
justify the answer to this question, than just a point 
estimate. Two software fault data sets are analyzed with this 
technique emphasizing how a software manager should analyze 


the results. 
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26 INTRODUCTION 

Prior to costly operational testing of a system consisting 
of hardware and its embedded software, it would be highly 
desirable to know whether these two major components are 
sufficiently reliable to support such testing. Specifically, 
this is equivalent to asking whether the software has reached 
a state of maturity such that unforeseen faults (bugs, errors, 
system crashes, etc.) are not likely to occur during 
operational test of the entire system, or later, during a 
systemic mission. 

Estimation of hardware reliability is relatively well- 
understood. Unfortunately, software reliability or maturity 
prediction is not as well understood at this time. The 
ANSI/IEEE definition of software reliability is the ability of 
a program to perform a required function under stated 
conditions for a stated period of time (IEEE, 1984). Since 
testing software has an associated cost whether it is in 
computer run time, labor costs, lost market share resulting 
from late delivery of a product or, in the case of military 
equipment, sacrificed range-testing time and aborted missions, 
there is a finite time allocated for testing and removal of 
Pauwets (bugs). A moderate-sized program with 264 branches 
would have 2*% independent paths (greater than the estimated 


number of atoms in the universe). Obviously, it is infeasible 


to test each path (Dalal and Mallows, 1988). Testing and 
debugging costs are estimated to range from 50% to 80% of the 
costs for development of a working version of software 
(Beizer, 1984). The constraints of a finite time period for 
testing and the cost of testing are excellent incentives for 
prompt and accurate determination of software reliability. 
Put in the form of a question: when can testing be stopped 
and the product delivered with a high level of confidence that 
the customer will be satisfied? 

Software reliability estimation is based on the results of 
testing. Software testing can be broken down into four major 
categories: unit, integration, system and regression testing. 
Unit testing iS uSually done by the programmer in an informal 
manner. Integration testing is done in an orderly progression 
such that the software elements are combined and tested until 
the entire software package has been tested. System testing 
is integration of hardware and software to verify that the 
system meets specified requirements. Regression testing is 
retesting to detect faults that may have been introduced 
during program modification (Hernandez, 1989). One purpose of 
testing is to produce quantitative measures of software error- 
proneness after effort has been expended in the integration 
testing, system testing, and fault removal phases. 

Software testing, a follow-on to hardware reliability 
prediction has been of considerable importance and interest 


from the mid-1960’s to the present. The Navy’s Operational 


Test and Evaluation Force recently (January, 1992) held a 
symposium for DoD agencies to discuss and exchange ideas and 
methodologies on software testing and reliability. There are 
two basic differences between hardware and software 
reliability predictions. Hardware prediction usually assumes 
independence of failures, and, after some point, the 
reliability measuring process does not affect the failure 
rate. Software reliability prediction models should assume 
interdependence of unit failures, and that testing improves 
reliability. Removing a program fault or bug during 
developmental testing reduces the likelihood that a fault will 
become operative later in an operational setting that will 
cause a mission to abort. The software fault-prevalence and 
appearance prediction problem has been judged to be inherently 
more difficult than hardware reliability prediction (Beizer, 
1984). 

There are several software reliability models that will be 
discussed later. Beizer in his seminal work Software System 
Testing and Quality Assurance (Beizer, 1984) summed up the 
Similarities of the models best. 


ila Most models assume a fixed but unknown number of 
faults when testing. 


2. Faults are universally assumed to be independent (some 
of the later models, Schneidewind’s Software Reliability 
Model for example, do not necessarily make this 
assumption). 


3. Most models assume perfect debugging. That is, the 
debugging process introduces no new faults. However, some 
of the later models take into account that not all 


detected faults will be fixed, and that the debugging 
process itself may introduce new faults (Littlewood and 
Verrall ‘'s Bayesian Reliability Growth Model takes into 
account imperfect debugging). 


Aw Most models assume that test time and calendar time 
are the same. 


5. The models assume that failure rate is proportional to 
the faults remaining. This implicitly means that faults 
are assumed to cause single failures and each failure can 
be related to one failure. 
6. The models assume path homogeneity. That is, data are 
entered randomly and such data uniformly exercise all 
code. This is in direct contradiction to the reality that 
the most paths cover a small percentage (say under 10%) of 
the code. 
The difference between the models lies in the degree with 
which these assumptions hold true, i.e. the type of random 
process according to which the failures occur, and how data is 
fitted to the models (Beizer, 1984). 

The models that are described in Chapter II do not 
necessarily perform well for all types of data. There is no 
"Silver bullet" (Brooks, 1986) that will take on all comers 
successfully. One model may predict reliability well for one 
data source but not another. The users of the models must 
take into consideration the predictive quality of a model 
prior to basing decisions on the output of the model (Abdalla 
et al, 1986) and (Goel, 1985). One possible way to do this is 
to analyze the data using various models. The manager selects 


the model that demonstrates the best predictive qualities, 


i.e. the model that appears to best fit the data and provide 


useful results. The choice is difficult because it is 
conducted in an atmosphere of uncertainty. 

Our hypothesis is that software reliability can be 
predicted, but with error. It is important to take account of 
the variabilities and uncertainties that are inevitably 
present, at least those associated with sampling (finite 
data), the most serious errors may be associated with model 
choice, however. To test this hypothesis of predictability we 
analyze sources of fault (error or bug) data using a 
modification of the BELLCORE MODEL (Dalal and Mallows, 1988) 
to estimate the reliability of the particular software project 
and the quality of the prediction produced by the model. 
Parametric estimates are made by maximum likelihood but also 
by use of an approximate Bayesian technique. Error estimates 
are made by a re-sampling technique known as bootstrapping. 

The parametric bootstrap technique was used in the 
aftermath of the Challenger disaster to analyze the O-rings 
that failed. Although the analysis was done on hardware the 
methodology that we propose in Chapter III and the appendix is 
Similar. The analysis of the O-rings showed the bootstrap 90% 
confidence limits expected catastrophic failure rate of at 
least 13% at temperature of less than 31 degrees, but less 
than a 2% failure rate at temperatures above 60 degrees (Dalal 
et al, 1989). Had the NASA decision makers had this 
information available to them the consideration to postpone 


the launch may have been taken more seriously and the disaster 


prevented. The analogy for the software manager to consider 
is the predicted number of faults to occur for some specified 
time acceptable. It is hoped that, the wrong decision will 
not have consequences as severe as the Challenger disaster. 
The techniques that we describe provide a quantitative tool 
for the software manager to substantiate the decision to 
schedule (postpone) system operational testing. 

In Chapter II, we briefly describe several software 
reliability prediction models that have been proposed in order 
to provide a basis of understanding of the discussion. [In 
Chapter III and the appendix, we present the model fitting 
procedure, the method used to determine the quality of the 
prediction, the resulting data obtained from the analysis, and 
methods to improve this methodology from the perspective of a 
software manager. In Chapter IV, our conclusions are provided 


and directions for future research are suggested. 


er SURVEY OF SOFTWARE RELIABILITY METHODOLOGIES 
This survey is concerned with only two categories of 
software reliability models: those for time between errors 
(TBE), and for fault count (number of errors in a specified 


time). 


A. TIME BETWEEN ERRORS (TBE) 

TBE reliability assessments attempt to predict the mean 
time between failure (MTBF) of the ith failure based on that 
to the (i-1)th failure. The TBE can be measured in either 
central processing unit (CPU) time or wall-clock time. Wall- 
clock time can be misleading: it can elapse regardless of 
whether or not the program is running. From this information 
the software manager can gain confidence that the software 
will exhibit the operational capability to complete its 
mission: to operate without failure for a mission time. A 
system that experiences multiple, severe software errors that 
prevent the system from completing its operational mission is 
not ready for costly live exercises as in operational testing. 
For example, a system that is supposed to detect, track and 
engage a missile during a scenario of five minutes’ duration, 
but whose software experiences a severe fault every thirty 
seconds on average, is obviously not ready to conduct an 


expensive live exercise or actual mission. Here are some 


models that attempt to predict (mean or average) time to 
failure. 
1. Jelinski and Moranda Model 
Jelinski and Moranda developed the "De-Eutrophication 
Model" (Moranda and Jelinski, 1972), (Farr, 1983). The 
assumptions are: 
® The rate of fault detection is proportional to the current 
fault content of a program. 


® All faults are equally likely to occur and are independent 
of each other. 


® Fach fault is of the same severity as any other fault. 


® The fault rate remains constant over the interval between 
fault occurrences. 


® The software iS operated in a manner similar to 
anticipated operational usage. 


@e The faults are corrected instantly, without introduction 
of new faults into the program. 


The hazard rate for the ith fault is 


ZC) —0 (Se (2209) 
where: N = total number of faults initially in the system 
1 =-Len fault. eoneoceuns 
@ = proportionality constane. 


X = t, - t,, 1s the time between the ith and the (1-1)st fault 
and 1S assumed to have an exponential distribution with rate 


GA) 


£(X,) <O[N-(a-1) |e 51a ae (2 2 


The likelihood function for the parameters 98 and N is 
ioe X,) =[[- Biv= me) eromee oe) 6) (2 3) 
(Pees oS At . 


Taking the partial derivatives of ln(L) with respect to N (N 
is allowed to assume any real value as a convenient 
approximation) and 8, and then setting the equations equal to 
zero, the solutions for the following set of equations are 
obtained as maximum likelihood estimates for N and 6 (N is 


estimated by numerical techniques, then used to solve for @): 
= n 
a SEA SAREE oo 
ae - De - “ 


1 
>: Se 


ee 7 : Ds 
N- = (S"_ (4-1) X)) (2.5) 
doin *: 
1=1 
The estimate for the mean time between failure (MTBF) for the 


(i+1)st fault occurrence is 


al alt 


"BF. = PIS a 
pct ts Z(t;) 6(N-2) 





(2.6) 


The data required to use the Jelinski-Moranda model are the 
observed times of the fault occurrence (t;’S), or the times 
between the faults (xs). 
2. Schick-Wolverton Model 
The hazard rate for the Schick-Wolverton model 


(Schick and Wolverton, 1978) and (Farr, 1983) is proportional 


to the number of faults in the program and the amount of 
testing time. An assumption of the model is that as more 
testing is completed the probability of detecting faults 
increases because of "zeroing-in" on the areas of code where 
the errors lie. The assumptions are: 

@® The rate of fault detection is proportional to the current 
fault content and to the amount of time expended in 
testing. 

@® All faults are equally likely to occur 

® All faults are independent of each other 


@® All faults are of the same severity 


® The software is operated in a manner similar to the 
anticipated operational usage 


® Perfect fault correction occurs. 


The hazard function is 
Z(X,) -O(N=(2—=1) lee (2.79 


where: X, = the amount of time spent testing between the 
occurrence of the ith and the (1i-1)st fault 
N = total number of faults initially in the program 
6 = proportionality cCenstante- 


The reliability function of X, is 


x? 
R(X;) =exp (-8[N-(i-1)] >) (2.8) 
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The density function of X, is 
Xj 
He =e xe 9 ae 
If X’/2 is replaced by Y; the model is formally identical to 
the Jelinski-Moranda model previously described. iG) aeelche™ 
substitution of any known function of X, allows transformation 


to the Jelinski-Moranda model. N and 86 are estimated by 


MLE’ S: 
6 2n 


—1  ~A 05> Lune.” Dak 
yo (A (4-1) ) x? oo 


1 Yin Xi 
ae | =6 1 (Aiba, 


N-(i-1)) 2 


The estimate for the mean time between failure (MTBF) for the 
(i+1)st occurrence is 
ee aie 
26 (N-i) 
The data requirements are the time of the fault occurrence, t,, 
or the time between the ith and (1i-1)st fault. 
3. Geometric Model 

The Geometric model (Moranda, 1975) and (Farr, 1983) 
1s a modification of the Jelinski-Moranda "De-Eutrophication" 
model. It differs from that model as follows: it does not 
assume a fixed number of faults in the program, and the faults 


are not equally likely to occur because as debugging 


ab al 


progresses faults become harder to detect. The assumptions 
are: 
® There is an infinite number of total faults (the program 
is never totally fault free). 
@® All faults do not have the same chance of detection. 
® Detections of faults are independent. 


® The software iS operated in a manner similar to 
anticipated operational usage. 


® The fault detection rate forms a geometric progression and 
is constant between faults. 


The hazard rate for the ith fault is 


Ze) Doe, (2. 
where: t. = time between the ith and the (i-1)th fault 
D = initial hazard rate 
6 = fault detection rate (0<6<1) 
n= the nth fLaulewto occucz- 


X, = time between the ith and the (1i-1)st fault. The X, are 


independently and exponentially distributed with rate Z(t), 


so the density function of X, is 


F(X.) =DQi-te@ PD (2) ae 


at 
D and @ are estimated by MLE’s: 


§” 
pa 67x, 


D= (2. ash 
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Equation (2.16) is solved for 6, and that value is substituted 
imto (2.15) to find D. From these equations the MTBF until 


the (n+1)st fault occurs after n faults have occurred can be 





obtained: 
rs ab 
MTBF.,,=E(X_,,) = SOE (eal) 
The data requirements are the time of the ith fault (t,), or 
the time between the faults (X.), for i =1,2,...,n. 
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4. Use of Time Between Errors (TBE) Models 

The TBE for models in this category can be measured in 
either wall-clock time or CPU time. The models may be used to 
predict the expected time to the next failure. Confidence 
limits on the expected value should be used to obtain a range 
of time to the next failure. The software manager should be 
asking: is the expected time of next time of failure longer 
than the time required for operational testing of the software 
within the overall system? Tf the time required for 
operational testing of the system is greater than the mean 
time to failure for the (i+1)th failure then the prudent 


software manager should consider postponing operational 


iL 


testing in favor of continued developmental activity and 


testing. 


B. FAULT COUNT MODELS 
Fault count models use the number of faults that occurred 
in a testing interval to determine the expected number of 
faults in the next testing interval. Software managers can 
employ this method by simply counting the number of faults in 
a given test period i.e. day, week, or month, provided test 
exposures are the same. This provides insight into how well 
the testing process is working. 
1. Generalized Poisson Model 
The Generalized Poisson Model (Schafer et al, 1979), 
(Farr, 1983) is similar to the Jelinski-Moranda and Schick- 
Wolverton models but uses fault count observations in fixed, 
equal-length intervals rather than times between faults. The 
assumptions are: 
® The expected number of faults occurring in any time 
interval is proportional to the fault content (number of 
bugs remaining) at the time of testing, and to the amount 
of time that has been previously spent in testing. The 
actual number of faults that appear is assumed to be 


Poisson distributed. 


® All faults are equally likely to occur and are independent 
of each other. 


@® Fach fault is of the same severity. 


® The software is operated in a manner similar to the 
anticipated operational usage. 
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® The faults are corrected at the ends of the testing 
intervals. (Note: Faults discovered in one test interval 
may be corrected at another test interval; the only 
restriction is that the fault correction come at the end 

of the testing intervals.) 
Testing intervals are of length x, and f, faults occur during 
the ith interval. At the end of the ith interval a total of 


M faults are corrected. 


The expected number of faults in the ith interval is 


Pee EEO WN Meee (kay oes, X,) 7 (273:8)) 
where: @ = proportionality constant 
N = initial number of faults 

g, = function of the amount of testing time spent 


previously and currently and is nondecreasing; 
as testing progresses more faults are found 


specifically, 


0), (Ogee oo ae ere, Ye die ae (2739;) 
where @ 1S assumed known. 


f. is Poisson with mean = 6@(N-M,)g;. N and 6 are estimated by 


MLE’s: 
Q=_ dior Fi —_ (3.20) 
Da F~ Deiat 43-193 
5 gs (BoB) 
ia (om,,) it FS | 


fs 


These non-linear equations must be solved for 6 and N. From 
this the expected number of errors in the (n+1)st test 


interval can be obtained, 


E( £,4,) =O (N=M.) g(a ee (2.225 


rAn+1 

where: xX, 18 the anticipated testing time for the (n+1)st 
test interval. 

The data requirements for this model are the lengths of the 
test intervals, (x), the total number of faults corrected at 
the end of a test interval, (M), and the number of faults 
discovered in each interval (f,). 

2. Non-homogeneous Poisson Process Model 

The Non-homogeneous Poisson Process Model (NHPP) (Goel and 
Okumoto, 1979) and (Farr, 1983) assumes that the fault counts 
for testing intervals follows a Poisson distribution. The 
expected number of faults in the Poisson process model is 
proportional to the number of faults left in the program. The 
assumptions are: 


® The software is operated in a manner similar to the 
anticipated operational usage. 


® The numbers of faults detected, (f,), in the any test 
interval, (ieeeeste oo, are independent for any finite 
collection. of times €)\<t)7... 0 ee 


® Faults are of the same severity. 
® Faults are equally likely to be detected. 
® The cumulative number of faults detected at any time t, 


(N(t)), is a Poisson distribution with mean m(t). The 
mean, m(t), is the expected number of faults to occur for 
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any time period (0,t) and is proportional to the expected 
number of undetected faults at time t. 


@® m(t) is bounded. 


The specific mean function used is 

m(t)=a(1-e7P*) , (2223) 
and f; is the number of faults in the ith interval, 

if =NCG Ne.) ; (2224) 


where: a = expected total number of faults to be 
eventually detected. 


a and b can be estimated by MLE’s: 


dia fa ; (2.25) 


a= 


(1-e7 Pt) 
7 -6t, £ -Bt, -Bt,; 1 
m© j=l Ds Ball (ares: ae ae ae) 226) 
(1-e Pte) i=1 e Petia _@ Pts 


From the estimates of a and b the expected number of faults in 


the next (m+1)st test interval is estimated to be 


)-m(t_) =a(e Pte-e Pte) (2.27) 


m+i1 m 


jeg is 


The data required for this model are the fault counts of each 
test interval, (f,) and time of the test interval, (t,). 
3. Schneidewind’s Software Reliability Model 
Schneidewind’s model (Schneidewind, 1975) and (Farr, 


1983) maintain that as testing progresses the fault detection 
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process changes. The later faults are therefore more useful 
in determining future fault counts. The model allows for 


three approaches. 


alae Utilize all the fault counts from the m intervals. 


De The first (s-1) intervals are ignored and only the s 
through m interval fault counts are considered. 


Ee The first (s-1) intervals fault counts are summed, and 
the individual fault count from the remaining s through 


m intervals are treated individually. Denote the sum of 
the fault counts in the first s-1 intervals by: 


ee 1a (2.28) 


Method 1 is used when the analyst feels that all intervals 
will be useful. Method 2 can be used when a significant 
change in the fault detection process has occurred at 
approximately the (s-1)st interval. Method 3 attempts to 
combine the effects of both approaches. The assumptions for 
all methods are the same: 

e aa fault counts for each interval are independent of each 

other. 


@® The fault correction rate is proportional to the number of 
faults to be corrected. 


@® The software is operated in a manner similar to the 
anticipated operational usage. 


® The mean number of detected faults decreases from one 
interval to the next. 


@® Intervals are all of the same length. 
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@ The rate of fault detection is proportional to the number 
of faults remaining. The fault detection process is 
assumed to be a non-homogeneous Poisson process with an 
exponentially decreasing appearance and detection rate. 


The rate of change of the number of faults detected in the ith 


interval is 


d,=ae' Pt) , (2229) 
The cumulative mean number of faults that occurs up to and 


including interval i is 


Ds= (1-e°F) (ease 


1 


The mean number of faults for the ith interval is 


eID e DE See 8) (aaa) 


a and 6 can be determined by MLE’s: 


Seulian (Gy) (2.32) 


ne Des fF) B (2.33) 


1-e Bm 
For Method 1, y is the solution to: 


F 
eT oe (2e34) 
Vay 1 





where: 


ibe, 


sy (sti-1) Laan : ( Zee Sy) 


Fee (2.36) 


For Method 2, y is the solution to 


Ay™*"*-(A+F, ,) y™ "1+ ((m-s+1) F, A) y+ (At FP, pao eee 


(2.375 
where: 
A= eee, (2.38 
ey He . (2.39) 
Pe ein 3) ; (2.40) 
1-e°8 


For Method 3, y is the solution to 





ASST) Fee eer (2.41) 
ve) y-1l y-1 


where: A is the same as Method 1 and F,, is the same as Method 
2. From the MLE’s of & and B the expected number of faults in 


the (m+1)st interval is 8 | . 
ae) “4 (ef ecko (2.42) 
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The time needed to detect a total number of M faults is 


r 
ik a 
a Cae | (Casey 


B 
The data needed for this model are the fault counts for each 
interval and a history of testing process in order to 
determine the interval that testing procedures may have 
altered significantly. 
4. Use of Fault Count Models 

Fault count models use the number of faults that occur 
in some testing interval. The models in this category predict 
the expected number of faults to occur in some additional time 
interval. Confidence limits on the expected number should be 
used to obtain a range of the predicted number of faults to 
occur for that time interval. Since there can never be a one 
hundred percent guarantee of perfect software, the software 
manager should be asking: is the predicted number of faults 
to occur for the time interval of interest acceptable for 
operational testing? If the predicted number of faults to 
occur iS too great then the prudent software manager should 
postpone operational testing in favor of continued 


developmental activity and testing. 


C. SOFTWARE RELIABILITY MODELS 
The number of software reliability models continues to 


grow. Assumptions have broadened to reflect the reality of 
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the software development process with increased accuracy. The 
assumptions of some models described appear to be limiting. 
Faults all of the same severity can be worked around by 
modeling faults according to severity. The assumption that 
all faults are equally likely to occur and independent of each 
other can be resolved by assuming low severity faults occur 
more frequently than high severity faults, but faults of the 
Same severity class will be considered equally likely to 
occur. Instantaneous fault correction can be avoided by not 
counting faults which were previously detected (and counted at 
time of initial detection), but were not corrected (Farr, 
13383) 

Software managers need to be aware of the limitations and 
underling assumptions that underlie the various models that 
are available. The data that is needed to fit the models is 
critical to reliable results. The data collection needs to be 
an accurate reflection of the meaningful historical testing of 
the software. Some of the data that should be collected is 
computer usage time, testing intensity, extent of the software 
that was tested (was the entire system tested or just a 
particular module), and milestones in the software’s 
development (are requirements changed or added midway through 
the development of the software?) and, of course, the cost of 
testing. 

This study illustrates the use of a particular reliability 


model. Some of the specific questions that this thesis 
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addresses are: How iS a software reliability model used? 
What type of information does a model require? What kind of 
decision can a software manager make based on the results of 
the reliability model? 

In today’s fiscal environment software managers should 
have a "warm fuzzy feeling" substantiated by quantitative 
mesults for their product prior to initiating costly full 


scale, live operational testing. 
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III. DATA ANALYSIS 


A. MODEL DEVELOPMENT 

The model that is applied in this thesis is based on the 
assumption that the rate of error occurrence is a non- 
stationary Poisson process (NSPP) (Dalal and Mallows, 1988). 
The model is identical to the Schneidewind model, and is 
fitted according to Method 1, which assumes that all fault 
data is of equal value. Let N(t) be the number of faults that 
occur in (0,t); where t is software running time. The 
probability that the number of faults to occur by time t is 


given by: 


P(N(t) =n} = 2 AEE (3.1) 


where A(t)=A(1-e“). A test time, t,, was chosen. This length 
of time is divided into periods of length A = t,/J; where J is 
the total number of intervals. The jth interval is such that 
(j-1)A<t<jA. The number of observed counts (faults) in the 
Jet VInterval is “nz The probability distribution for the 


number of faults in [(j-1)A to jA] is 


aye 
P{N,;=N(jA) -N( ae) A) =a Ses (A 5) 





, =O) ee (Sze 
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where 


A, =E(N,] =A (1-e #94) -A(t-e~bhU-VA) (ae 2a) 


=Je(-1)4 (4 -e HA) : (3.20) 


The parameters p and A are estimated by maximum likelihood. 


The likelihood function is 





ys wa, (45) 
L(A, w=] ,e” Yi (3.3) 
The natural log of L(A,p) is 
(A, p) =1n(L) =-S™ Ast) yl (Ay) - (3.4) 


The partial derivatives of 1(A,p) with respect to A and yp are 
taken and set equal to zero. This allows Metombenwmait ten an 
terms of p and n(t,), the total number of counts to occur up 


to time t,, as allan 
ee (3.5) 


(l-e bt? 


h is substituted into the partial derivative of 1 with respect 


to p to give, 


“pl, —pA 
Bi Ea (2 3) ae yee (3.6) 
1-e 1-e A 
where = 
MEJ= Gata, (3.7) 


pu can now be solved for from the following equation: 


Z5 


Aets ete eae (3 7en 
1-e HA 1 =e Hee n( e) 


This equation closely resembles Schneidewind’s result; see 


(2.41). Since t,=AJ, equation (3.8) becomes 


eta Jets _ntt,) (3.9) 
es eee AS). 





then, = 
eth 2 (eae _ nt.) 


ieee T-(e-Aye nit.) . (3. aon 


By letting x=e** into equation (3.10) becomes, 








=r; (3 2 


x is solved for iteratively. Let J=0 for the first iteratvom 





then 
set) _1(t,) 3.46 
tO) yx) att) a 
eq) = ee (3 2a 
n(t.) =n ©.) eee ae 

r({(2) is 

rQ)<r)+7 See (3.14) 
11a 


x(2) 1S given by, 
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X ( TTS (3.15) 
Hence the iteration of r(n) and x(n) is 
caine) Sela) (3.16) 
dCi ae eed 
ee Sal, 
Belt) eee (427) 


The iterative process continues until x(n+1)-x(n) < €, where 
€ 1s a suitable small number; x(n+1) is then substituted into 
equation (3.5) to get we Using the estimates of p and A, the 
expected number of faults to be observed in some additional 


operating time t,, where (t,, t,+t,) 1s of length kA, can be 


estimated 


(1-e Akay} (3.19) 


A Bayesian methodology is discussed in the appendix. This 
method attempts to utilize past experience from software 
projects having similar characteristics as the software in 
question. If the distributions of A and yp are known from 
experience then this information can be useful in estimating 


~ 


the parameters \ and p. 


In4) 


B. BOOTSTRAP 

Bootstrapping was used to obtain the confidence limits for 
2, K, and E[N(t,)-N(t,)] = E[AN(t,)]. This technique takes 
into account the sampling uncertainties in the estimates by 
removing the errors in the standard approximation (Dalal et 
al, 1989) and (Efron, 1985). To obtain the estimates of the 
sampling variability of jp, A, and E[N(t,)-N(t,)]=5 (Agee 
proceed as follows. The probability that a count occurs in 


the jth period 1S conditienal on Nie aie 


P{N,=n,,...,N,=n,|N,+No+...+N,=n(t,)} (3.20a) 


De hae 
it. (3 .20b) 


o=1 n;! > A, 


where LA,=1-e-"*". From this the probability that a count falls 





in the jth interval is 

Lee 

7 1-e77A 
Uniform (0,1) random numbers were generated, where the 
k=1,2,..,n(t,); U, 1s the KEn vandom number aes Piy<U,sP; then 
a count is added ton. The simulated n,’s were then usecase 
re-estimate jf, \, and E[AN(t,)]; these are the bootstrap 
values. This process was repeated 1000 times to get a range 
of values for hi, r, and E([AN(t,)]. To create a 90% confidence 


limit of the estimate E[AN(t,)] the 1000 bootstrap estimates 
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of E[AN(t,)] were ordered and the values of the 50th and 950th 
quantiles were found. These are quoted as the 90% confidence 


region (E[AN(t,)],;, E[AN(t,) ]os) - 


C. RESULTS 

The estimates for the parameters were obtained using three 
Gifferent A values and three different t, values. The value 
t, was selected such that t, + t, = time of last observed fault 
to occur; this allows for comparison of the predicted expected 
number of faults to occur with the observed data. The data 
provided in Tables 1 through 6 are the 90% confidence interval 
obtained by the bootstrap. The most difficult aspect of this 
thesis research was obtaining appropriate test data. The 
data that I received from various sources was unacceptable for 
various reasons: no testing history, severity of faults not 
listed, no milestone events listed (i.e. one data set covered 
10 years but no indication of modifications to the software), 
non-software errors listed with software errors, description 
of errors could not be interpreted (which may have eliminated 
some of the problems mentioned above). The underlying cause 
of this is that organizations that I contacted for data do not 
use any systematic method for determining’ software 
reliability. A "warm fuzzy feeling" for the software seems to 
be the current method used to judge the reliability of the 
software. This feeling gets warmer and fuzzier as deadlines 


draw closer. The data sets used in the analysis of the model 
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were obtained from a technical report on other software 
reliability models (Abdalla et. al., 1986). The data was 
given as time (CPU) between failures. The results of the 
bootstrap for Data Set 1 are given in Tables 1-3; the 
graphical results (Dalal, 1990) are depicted in Figures 1-3. 
The results of the bootstrap for Data Set 2 are given in 
Tables 4-6; the graphical results (Dalal, 1990) are depicted 


in Figures 4-5. 


D. USE OF RESULTS 

Suppose a time t, has been spent testing the software, and 
n(t,) faults were found. The n(t,) faults can be broken up 
into nj’s, the number of faults in each period j of size Amiga 
=a) This information can be used to estimate the 
parameters p and rd, and a point estimate of the mean or 
expected number of faults to appear in the time interval (t,, 
t.+t,). Operational testing of the system will require some 
time t,. Bootstrapping can now be done to assess the sampling 
uncertainty in the estimate of the expected number of faults 
tO- appear in (t.,  ty4t.)< This will be done by quoting 
bootstrapped 90% confidence limits. The expected number of 
faults predicted to occur can be compared to the requirements 
of the system i.e. for some time t, for example; at most F 
faults are allowed (suppose F can be specified). If the 
predicted expected number of faults is less than the allowable 


number of faults then system operational testing might be 
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worth the expense at this time. In contrast to this, if the 
expected number of faults 1s greater than the specified number 
of faults then system operational testing should be postponed. 
Testing should continue in the lab, at the developmental level 
until t, and n(t,) are large enough that the expected number of 
faults for the required operational time meets specification. 

A more conservative approach is to replace the estimate of 
the mean number of faults by the upper confidence limit of the 
mean number of faults. Such a conservative approach is 
recommended. 

If there are no specifications the individual responsible 
for scheduling system operational testing will have to make a 
subjective decision. Is the expected number of faults to 
occur in (t,, t,+t,) small enough to warrant spending the money 
to carry out system operational testing, or should this 
testing be postponed until the expected number of faults is 
lower. The assumption is that lab testing will continue on 
the software, increasing t, and n(t,), but reducing the number 
of unfound and uncorrected faults. The more faults found in 
lab testing of the software the fewer the number of faults 
that are likely to occur in the more costly system operational 


testing. 
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E. APPLICATION TO TWO DATA SETS 

The fitting and error assessment procedure was applied to 
two data sets (Abdalla et al, 1986). Figures 1, 2, and 3 
refer to Data Set 1; Figures 4, 5, and 6 to Data Set 2. 

Figure 1 has a A of 10 CPU minutes with three combinations 
of t, and t,. If the range of the expected number of faults 
for t,=1250, t,=250 (2.21 to 6.09) is acceptable the software 
manager may choose to schedule operational testing. The same 
argument can be made for t,=1000, t,=500. A problem occurs for 
t.=500 and t,=1000. If the range for the expected number of 
Faults to occur (4.69 to 22.22) is acceptable the software 
Manager may choose to schedule operational testing. 
Unfortunately, 46 faults occur in Sie Sere This is 
extremely likely to be the result of use of an inappropriate 
model (it does seem unlikely that software with as many as 22 
mission-critical faults would be viewed as acceptable for 
starting operational testing). What can the software manager 
do to prevent something like this from occurring? Ideally, as 
testing continues, the rate at which faults occur should 
decrease (assuming a constant relative rate of testing), with 
that rate asymptotically approaching zero as t, becomes large. 
The slope of the estimated total expected number of | tauere 
verses test time for Data Set 1 from T=300 to T=500 is m=0.08 
(faults/cpu min). Figure 1 depicts this: the rate at which 


faults are occurring does not appear to be tapering off. The 
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software manager can use this information to support a 
decision to go ahead with (or postpone) operational testing. 
From T=1000 to T=1500 the slope is 0.028 (faults/cpu min) and 
appears to be tapering off. The range of the expected number 
of faults to occur in the specified t, accurately reflect what 
peewally occurred. If the range of the expected number of 
faults is acceptable the software manager should go ahead with 
operational testing. Figure 2 (A = 20 cpu minutes) and Figure 
3 (A = 50 cpu minutes) can be interpreted similarly. 

The change in A for both data sets did not have a 
Significant impact on the range of the expected number of 
faults to occur, indicating that the model is’ somewhat 
insensitive to the size of A. 

Data Set 2 (Figures 4,5, and 6) shows only a small 
indication of the slope decreasing. This is why the 
confidence limits of the expected number of faults is so wide. 
The software manager can apply the same techniques listed 
above to make a decision to schedule (or postpone) operational 
testing. The software manager must repeatedly address the 
Questions: is the rate of occurrence of faults lessening, and 
is the range of expected number of faults acceptable to 
Support operational testing? 

A fitted model may indicate a narrowing range of expected 
number of faults and slope asymptotically approaching zero, 


consequently the software manager schedules operational 
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testing. Unfortunately, the results of the operational 
testing may be poor i.e. a relatively large number of errors 
may occur indicating that more developmental activity and 
testing is required to improve the software. For example, the 
model predicts n(t,)=22 for Data Set 1 (t,=500, t,=1000) ue 
the number of observed faults that occurred in t, was more 
than twice the predicted amount, 46. This example illustrates 
the relationship between modeling and testing. While a 
systematic underestimation indicates flaws in the model, 
occasional underestimation simply reinforce that software 
reliability models do not take the place of stressing software 
within a full system ina real-life operational environment. 
The purpose of this thesis is to provide the software manager 
with a tool to aid in the decision as to when to initiate 


operational testing, not to replace such a test. 
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TABLE 1 
ESTIMATE OF PARAMETERS FOR DATA SET 1 
t,.=1250, t,=250 (CPU MINUTES) 
Observed number OFBbDUgGGeinetw1is 6 
90% Confidence Interval 


(cpu min) | og Rte 


3) 55 0.00272 34.5095 Zee 
OSs Fi Of 010176 146.509 5276 


0.00270 | 134.993 ry, 

92 : 0.00174 147.798 6.09 

50 5 % 0.00270 135.258 2.25 
95 % 0.00175 148.142 Bel | 


TABLE 2 
ESTIMATE OF PARAMETERS FOR DATA SET 1 


t,=1000, t,=500 (CPU MINUTES) 
Observed number of bugs int, is 14 
90% Confidence Interval 


BIN (9) 


Sat OR 00298 eZ © 37 Oi sy Ore 

95 3 ORO00177 147.640 14.73 
OmaG0296 12829169 5 . Ow 

se ‘ O00 176 ao .393 4-0 8 

50 545 OROUZI5 Jat) to \74i3) Spey 7) 
a5) 5 O00 7 5 150.549 14.96 















TABLE 3 
ESTIMATE OF PARAMETERS FOR DATA SET 1 
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TABLE 4 
ESTIMATE OF PARAMETERS FOR DATA SET 2 
t,=800, t,=300 (CPU SECONDS) 
Observed number of bugs in t, is 12 
90% Confidence Interval 
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IV. CONCLUSION 

Software reliability models are useful tools that managers 
of software intensive projects have at their disposal. The 
bootstrapping technique will provide the manager a range of 
expected number of faults estimated to occur for some 
additional operating time. The question is, is the upper 
limit of the expected number of faults estimated to occur 
acceptable? The potential risks are additional cost for 
further testing or late product delivery. The ideal case is 
reliable software delivered on time and on budget. 
Unfortunately, reality is rarely ideal. The software manager 
must decide: is it better to deliver a product on time that 
may be considered unreliable by the user and be sent back for 
further testing, or to deliver a product late but of 
acceptable quality to the user? The purpose of this thesis is 
to provide a quantitative tool for the manager who may have to 
make such qualitative decisions. The use of software 
reliability models is not without associated cost, and risk. 
The data must be collected for input to the model. 
Recommendations for the type of data that should be collected 
are: 


® Operating time between failures (CPU time is the best) 
(Musa and Okumoto, 1984). 


® Calendar time between failures, although such times may 


not accurately reflect the opportunity for faults to 
reveal themselves (Musa et al, 1987). 
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® Testing history i.e. how many people are involved in the 
testing effort. 


@® How the software was tested 

® Intensity of the software testing 

® Cost of testing i.e. the cost to find and repair a fault 
before and after product delivery. 

Without useful data a reliability model has little 
practical use. The model presented in this thesis should be 
validated using data from several Navy systems. 

There are several areas for further research. How 
accurate are the predicted confidence limits in this model? 
What are the limits of applicability of this model? What 
effect do inaccuracies (due to replacing observed data with 
hypothesized data in cases where insufficient data is 
available) have on the model i.e. how robust is the model? 
Further development of other software reliability models 
should be pursued. Emphasis should be placed on obtaining 
confidence limits in addition to quoting only a point estimate 
of the expected number of failures predicted to appear for 
some additional testing time. These models should be verified 
using data obtained from Navy software intensive systems. It 
is infeasible to test every possible branch in a large program 
for faults. The software manager needs technical assistance 
in identifying where effort and money should be spent to 
deliver the best possible product. Will many faults in 


portions of the software that are rarely used/reached cause 
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more problems for the user than a few faults in frequently 


used/reached portions. 
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APPENDIX 


Software projects may have similar characteristics such as 
testing strategies or architecture, so that the information 
obtained about the reliability of one software project may be 
used to aid in the prediction of the reliability of another 
Similar, software project. This process can make use of 
Bayesian methodology (Dalal and Mallows, 1990), (Farr, 1983). 
If prior distributions of A and yp are specified then this 
information can be used help estimate the parameters A and pu; 


the posterior for these is 


Py yl(A,p) =KL(A, p) Pp (A) Dp, (p) , (a.la) 


=Ke -A(1-e ¥*) 4 alee) IT. ere ag, (1-e7 #4) Mea aa , 


(a.1b) 
where p,(A) and p(w) are the prior distributions of A and uz 
estimated from another software project that has 
characteristics similar to the software project currently 
being tested. The simplest idea is to integrate out A and 


marginalize on yw which yields: 


oo 


Py (nh) =Kf "ee “9478 ph) dhe Wa (=e 18) ne 
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The most convenient choice of p,(A) is the (conjugate) Gamma: 


(eager 


-p-aa 
pla) e T'(p) ’ 


(a.3) 


which when substituted into equation (a.2) yields the density, 


DP, (pb) =Ke “pAn(t,) (enue) GH |e tiara =! ynlts) 5-aa_(@A) a 


0 r(p) ° 
(a.4a) 
Sige agate) 2) f eZznltls) *B-1 4, ; (aera) 
0 (a+1—-e PAT) 2 Ee) +P 


= Kg vanlt,) (1-6 7BA) Bee) 1 


as A 


Using an uninformative prior, a=0, 6=0, and setting x=e"* 


equation (a.4c) becomes 


al 


\yae-4) = Kt yl te) (ay oe 
Caealaae! 


yee: +B 


The mode of the density is 


ie) =n (p, (x) ) =n(t,) Inxtn(c,) In(1-x) -(n(¢,) +B) In(a@+1-x”) 


(a.6) 


Taking the partial derivative of equation (a.7) with respect 


moex yields: 
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mice x _n(t,) +B upee 


n(t.)  1>siiee a+1-x7 





(a 78 


If a=68=0 equation (a.7) is the same as equation (3.11), which 
gives the MLE. 
Suppose m=E[A] and o* = Var [A] in the prior, then @ = m/o? and 


B = m(m/o’). Equation (a.7) is 


x _ n(t,) +m(m/o?) ox? ee Tes 


OU = eee eee (a. 8) 
ix igs) (m/o*?) +1-x/ n(t,) 


If A is interpreted as the total number of faults in a 
particular software project, then the number of faults is 
discrete so a discrete distribution should be used for the 
prior, i.e. one could use a Poisson for the prior. However, 
it is easier to work with a Gamma distribution. If the Gamma 
distribution has same parameters as a Poisson then equation 


(a.8) is (since m=oa’) 


hace pine kee 2 'e) , (a.9) 


It is clear that the variance to mean ratio of the prior has 
strong influence on the effect of a prior estimate of the 
mean. . 

One BayeSian approach to eStimation is to find the mean 


(rather than the mode, or highest point of the posterior as is 


essentially done in the likelihood approach) of the 
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(approximate) posterior, Osxsl. To obtain an approximate 
posterior mode proceed as follows. If J is large x is small 


provided x>0, so expand in Taylor’s series to get 
p,(x) =k** (x¥(1-x)"% (BEB) (xh) (1-x)"1 , (a.10) 


where: n=n(t,) and n=n(t,). 


Equation (a.10) is a convex combination of two beta densities. 


K can be found by setting the left hand side of (a.11) = 1. 


E[x] = E[e*4] can be found, 


xeker (Ltn (m1) n+1 


= = (a.11) 
Pin+na+1) nen 
1+o P(n+J+n+1) n+J+n+1 
The approximation to this is 
n! loll 5. bem Geneeo ae renewal 
= = a = 
pee) Le (n+J+n) 1 ntJ+ntl _ pa (a.12) 


nN! Heel’ (n+J) ! 


(n+n)! 1+@ (n+J+n)! 


Unfortunately, n=n(t,)=136 for Data Set 1; even with factoring 


out n=n(t,), the factorial ratios are on the order of 10°”. 


However, it is justifiable to use an approximation to the 


factorials to get 


geal |. ser) eke ales 
ye Sones Seagate ensenel 


1+ 2tB (_ntl ya 


Ae OGs all 


(a.13) 
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The numerical results of equation (a.13) are in Tables Al 
through A6 for Data Sets 1 and 2. The graphical results are 
shown in Figures Ail through A6é. The range of the estimated 
number of faults to occur in (t,, t,+t,) 1s much smaller than 
that of the bootstrap results discussed in Chapter III. None 
of the results (estimated number of faults to occur) using the 
Bayesian method contain the observed faults. A possible 
explanation for this is inappropriate values for a and 86 
(v=620) After various projects have been analyzed with 
software reliability models, fault distribution may become 
more apparent. This information can then be incorporated to 
reliability models. I feel that, despite the surprising 
initial results, this method does promise to be a useful tool 


to the software manager. 
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TABLE A2 
BAYESIAN ESTIMATE OF PARAMETERS FOR DATA SET 1 
t,=1000, t,=500 (CPU MINUTES) 
Observed number of bugs int, is 14 
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TABLE A4 
BAYESIAN ESTIMATE OF PARAMETERS FOR DATA SET 2 
t,=800, t,=300 (CPU SECONDS) 
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90% Confidence Interval 
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BAYESIAN ESTIMATE OF PARAMETERS FOR DATA SET 2 
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